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Final  Report  for  Contract  N00014-85-K-0328 

An  Analysis  of  Multiple  Grain  CHiP  Architectures 

Lawrence  Snyder 

Department  of  Computer  Science,  FR-35 
University  of  Washington 
Seattle,  Washington  98195 


1  Scope  of  Report 


This  contract,  originally  planned  to  cover  three  years,  was  concluded  effective  September 
1,  1986  with  the  majority  of  the  scientific  questions  resolved;  only  one  year  of  funding  was 
actually  appropriated.  This  report  recaps  the  scientific  accomplishments  during  the  actual 
period  of  the  contract. 


2  Multigauge  Architecture 


The  concepts  and  issues  covered  by  this  contract  were  in  the  beginning  poorly  understood 
and  quite  amorphous.  This  makes  for  very  interesting  and  exciting  scientific  work,  but  it 
also  implies  that  there  may  be  false  starts.  The  first  and  probably  the  only  noticeable  one 
for  us  was  in  naming  the  concept:  “Multigrain”,  though  accurate  and  perspicuous  for  some 
scientists,  was  vague  and  ambiguous  to  others.  The  issue  is  that  the  concept  of  “granularity” 
is  used  many  ways  in  parallel  computation,  and  the  term  “multigrain”  confused  the  matter 
further  by  adding  one  more.  Accordingly,  we  created  a  new  term  “multigauge” ,  which  not 
only  disambiguated  it  for  existing  concepts,  it  built  on  a  useful  analogy  with  railroads. 
“Multigauge”  has  been  well  received  and  seems  to  be  being  adopted.  In  any  event  we  have 
used  it  throughout  our  work  since  early  1985,  and  will  do  so  for  this  report. 


Review  of  the  Concept 


A  multigauge  architecture  is  a  design  for  a  standard  von  Neumann  machine  that  permits  the 
data  path  to  be  partitioned  into  subunits  that  can  execute  concurrently  when  the  data  values 
are  “small”.  ThuB  it  is  a  method  of  achieving  parallelism  within  the  arithmetic/logic  unit 
of  a  computer  that  is  applicable  to  both  stand  alone  serial  processors  and  to  the  processor 
elements  of  highly  parallel  computers.  Its  advantages  are  the  twin  benefits  of  parallelism  for 
narrow  data  and  the  ability  to  use  the  processor  in  its  “regular  mode”  under  programmer 
control  when  full  precision  is  necessary.  There  is  the  added  benefit  that  multigauging  is 
available  with  minimal  added  hardware  in  certain  cases,  (explained  below)  but  this  is  simply 
a  benefit,  not  the  justification  for  the  architecture. 
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There  is  certain  terminology  and  notation  surrounding  the  multigauge  concept  that  will 
be  useful  in  the  subsequent  discussion:  A  standard  von  Neumann  machine  and  a  muitigauge 
machine  executing  at  full  precision  are  said  to  be  wide  gauge  machines  with  a  datapath  of 
B  bits.  A  multigauge  machine  is  partitioned  into  k  narrow  gauge  machines,  each  executing 
a  datapath  of  b  bits;  B  >  k  x  6,  and  equality  is  generally  assumed.  Not  all  possible  widths 
are  necessarily  available  with  every  multigauge  machine,  so  a  multigauge  machine  is  often 
described  by  listing  the  gauges  with  which  it  can  execute;  for  example  a  (32,  16,  8)  multigauge 
machine  can  be  partitioned  into  two  or  four  concurrent  processors. 

A  multigauge  machine  can  execute  in  either  SIMD  or  MIMD  modes.  In  the  former,  there 
is  a  single  instruction  stream  and  each  narrow  gauge  machine  has  its  own  data  stream.  The 
MIMD  case  provides  for  an  instruction  stream  and  a  data  stream  for  each  narrow  gauge 
machine.  Because  it  would  be  ludicrous  to  have  an  MIMD  bit-serial  machine,  we  postulate 
the  MIMD  threshold ,  the  minimum  value  of  b  at  which  it  is  “reasonable”  to  run  a  multigauge 
machine  in  MIMD  mode;  we  take  b=S ,  though  it  may  be  larger. 

These  concepts  and  related  work  have  been  described  in  the  literature  [l];  a  notable 
omission  in  the  cited  references  is  the  work  on  TRAC,  Texas  Reconfigurable  Array  Processor 
[4];  this  machine  was  a  parallel  processor  capable  of  a  certain  level  of  multigauge  processing. 

2.2  Multigauging  the  Quarter  Horse 

To  make  our  study  of  multigauging  concrete,  we  applied  the  concept  to  a  specific  micropro¬ 
cessor  architecture,  the  Quarter  Horse  [5],  This  single  chip,  32-bit  microprocessor  fabricated 
in  3 n  CMOS  technology  was  designed  (in  90  days)  at  the  University  of  Washington.  It 
was  thus  a  natural  choice  for  multigauging  experiments,  because  we  understood  it  at  the 
lowest  levels  of  detail.  Moreover,  single  chip  microprocessors  present  serious  problems  for 
multigauging  because  of  the  limited  “pin  out”,  so  the  challenge  was  even  greater.  Before 
describing  the  impact  of  multigauging  on  the  Quarter  Horse,  we  give  a  very  brief  description 
of  the  architecture. 

The  Quarter  Horse  [5]  has  a  32-bit  wide,  dual-bus  datapath,  32  general  purpose  registers, 
a  Mead-Conway  [6]  ALU  with  Manchester  carry-chain,  a  barrel  shifter  and  PLA  controller. 
The  typical  instruction  requires  6  “microinstructions”  and  the  design  specification  called  for 
a  75ns  clock  cycle.  The  machine  uses  a  seif  incrementing  program  Counter  (PC).  It  also 
uses  32  pins  each  for  connecting  the  memory  address  register  (MAR)  and  the  memory  data 
register  (MDR)  to  the  memory  system. 

2.2.1  Partitioning  the  Components 

Perhaps  the  first  transformation  necessary  for  multigauging  a  microprocessor  that  occurs  to 
one,  but  probably  the  least  complicated,  is  the  partitioning  of  the  datapath  components  to 
execute  independently.  In  this  section  we  review  the  changes  required  for  the  Quarter  Horse, 
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and  comment  on  the  difficulty  of  partitioning  elements  not  found  in  the  Quarter  Horse. 

The  register  file  is  trivially  partitioned  for  normal  multigauge  execution,  since  each  bit 
is  independent.  Nothing  has  to  be  done.  If,  however,  more  exotic  instructions  are  provided, 
then  some  additional  circuitry  may  be  necessary  [3].  Neither  the  MAR  nor  the  MDR  require 
any  change. 

The  Mead-Conway  ALU  requires  modest  modification.  For  all  but  one  of  the  narrow 
gauge  datapaths,  carry-in  logic  must  be  added.  Symmetrically,  all  but  one  narrow  gauge 
machine  needs  flag  logic,  if  it  is  to  be  executed  in  MIMD  mode.  (For  SIMD  execution  we 
assume  only  one  of  the  narrow  gauge  processors  can  influence  the  control  (branch),  and  so 
only  one  machine  needs  flag  logic;  this  might  as  well  be  the  wide  gauge  machine’s  flag  logic.) 
Finally,  the  carry  chain  must  be  segmented  between  each  narrow  gauge  machine,  but  because 
the  design  of  the  Manchester  carry  chain  provides  drivers  every  four  bits,  this  change  is  very 
straightforward. 

The  PC  need  not  be  split  for  SIMD  execution  and  for  MIMD  execution,  the  problem  is 
analogous  to  splitting  the  ALU. 

To  partition  the  barrel  shifter  requires  a  rather  surprising  modification  based  on  the 
following  fact:  A  barrel  shifter  of  6-bits  in  width  requires  a  height  of  0(6)  bits.  Thus, 
partitioning  the  shifter  in  half  requires  a  shifter  only  half  the  height,  and  so  each  of  the  two 
narrow  gauge  shifters  requires  only  a  quarter  of  the  area  of  wide  gauge  shifter.  Other  narrow 
gauges  take  correspondingly  less  area.  This  phenomenon,  which  is  based  on  the  0(n2)  growth 
rate  of  the  area  of  barrel  shifters,  and  which  is  predicted  by  the  theory  [6],  implies  that  both 
vertical  and  horizontal  wires  must  be  segmented  at  each  “narrow  gauge  boundary”  in  each 
dimension.  However,  the  definition  of  boundary  is  different  in  each  dimension  [7]. 

Other  typical  data  path  components,  though  not  found  in  the  Quarter  Horse,  tend  to  fall 
into  one  of  the  three  classes  already  identified:  “bitwise  independent”  components  requiring 
no  change,  e.g.  complementer,  “linear”  components  requiring  straightforward  segmentation, 
e.g.  a  counter,  and  “quadradic”  components  requiring  two  dimensional  segmentation,  e.g.  a 
multiplier. 

2.2.2  Memory  System  Partitioning 

When  a  wide  gauge  machine  is  partitioned  into  narrow  gauge  machines,  the  size  of  the  data 
address  can  be  dramatically  reduced.  If  this  means  that  the  size  of  the  addressable  memory 
is  also  reduced,  then  multigauging  will  not  be  a  desirable  way  to  exploit  parallelism.  (Recall 
that  dividing  the  address  in  half  reduces  the  address  space  by  dividing  its  log  in  half;  e.g.  a 
32  bit  address  refers  to  over  four  billion  locations,  but  a  16-bit  address  has  roughly  sixty- 
five  thousand.)  Although  it  would  not  quite  solve  this  addressing  problem,  adding  more 
wires  between  the  processor  and  the  memory  would  seem  to  offer  the  right  “fix.”  But  it  is 
definitely  prohibited  by  the  fact  that  single  chip  processors  will  not  have  enough  “pins”  from 
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the  “package”.1  Obviously,  the  von  Neumann  bottleneck  gets  a  lot  narrower  in  the  presence 
of  multigauging. 

Fortunately,  there  is  a  rather  straightforward  solution,  that  sacrifices  very  little.  The 
technique  is  to  provide  segment  registers  on  the  “memory  side”  of  the  memory- processor 
interface  for  each  narrow  gauge  machine.  Each  register  contains  the  high  order  bits  of  the 
narrow  gauge  machine’s  address,  and  for  the  “larger”  gauges,  e.g.  8  bits  or  greater,  provides 
adequate  addressing  capability  in  each  segment.  Special  instructions  are  provided  to  load 
the  segment  registers,  which  may  require  more  than  one  narrow  gauge  operation.  If  the 
machine  is  to  be  executed  in  MIMD  mode,  it  is  advisable  to  have  both  an  instruction  stream 
and  a  data  stream  segment  register  for  each  multigauge  device. 

As  a  benefit  for  supporting  addressing  when  the  segments  are  “smallish”,  it  is  advisable  to 
provide  an  “add  with  carry  to  memory”  instruction  for  computing  a  sequential  address.  The 
idea  is  that  when  the  segment  addresses  are  computed,  there  should  be  a  smooth  mechanism 
for  crossing  the  segment  boundary,  or  at  least  recognizing  when  it  has  been  crossed.  This 
will  add  flag  logic  to  the  SIMD  case,  but  it  is  our  estimate  that  it  is  more  effective  than 
depending  exclusively  on  the  software  to  manage  segment  boundaries. 

2.2.3  Control 

Perhaps  the  most  difficult  problem  in  multigauging  is  the  control  of  the  various  machine 
forms  [2].  To  simplify  the  matter,  we  consider  SIMD  and  MIMD  separately. 

The  ideal  situation,  available  only  to  the  SIMD  case  basically,  is  to  use  a  single  controller 
for  the  wide  gauge  and  all  of  the  narrow  gauge  machines.  This  requires  that  each  machine 
have  essentially  the  same  instruction  set.  The  obvious  exceptions  are  the  gauge  shifting 
instructions  which  only  apply  in  one  gauge,  and  control  instructions  which  have  somewhat 
different  semantics  in  different  gauges.  A  single  controller  is  feasible  because  the  control  lines 
generally  run  “perpendicular”  to  the  datapath. 

MIMD  multigauge  machines  have  multiple  program  counters  and  thus  execute  different 
instructions  at  the  same  time.  The  control  implication  of  this  fact  is  that  different  control 
lines  must  be  set  for  each  narrow  gauge  machine.  This  probably  means  that  there  are  separate 
controllers  for  each  narrow  gauge  machine.  It  would  be  possible  in  the  microprogrammed  case 
to  use  a  multiported  control  ROM  with  separate  microPCs  for  each  narrow  gauge  machine; 
the  problem  is  that  a  careful  design  would  be  needed  because  microcode  sequencing  is  much 
more  complex  than  normal  program  sequencing. 

The  final  observation  about  MIMD  control  concerns  the  problem  that  the  instruction 
streams  and  the  data  streams  should  have  their  own  segment  registers  for  each  narrow 
gauge  machine,  and  that  in  MIMD  execution  the  narrow  gauge  machines  can  be  executing 
different  instructions  at  any  given  moment.  This  means  that  one  machine  might  be  fetching 
an  instruction  and  another  might  be  fetching  data;  the  memory  system  must  be  told  which 

‘Exotic  packaging  help*  but  doesn’t  solve  the  problem,  so  we  face  the  problem  headon. 


segment  register  to  use  in  each  case.  This  can  be  done  using  additional  control  signals 
between  the  processor  and  memory,  but  if  the  prohibition  against  additional  wires  between 
the  processor  and  memory  is  in  force,  then  we  will  be  required  to  observe  a  protocol  in 
memory  reference.  The  rigid  reference  ordering  can  be  implemented  independently  on  each 
side  of  the  von  Neumann  bottleneck  and  thus  no  additional  wires  are  needed.  Use  of  the 
protocol  is  no  real  problem,  because  many  RISC  machines  -  the  kind  for  which  the  prohibition 
can  be  expected  to  apply  -  use  a  regimen  in  which  memory  references  follow  a  rigid  protocol. 

2.3  Applications 

The  possibility  of  creating  a  multigauge  machine  can  be  seen  rather  clearly  from  the  foregoing 
remarks.  What  is  not  entirely  evident  is  whether  there  is  any  beneficial  speedup  from 
multigauging.  The  matter  is  complicated  by  the  fact  that  certain  multigauge  machines  can 
be  implemented  with  little  or  no  additional  hardware,  but  these  are  the  rigid,  SIMD  machines 
with  only  one  or  a  few  narrow  gauge  sizes.  It  is  not  clear  how  useful  such  a  machine  might 
be.  On  the  other  hand,  a  full,  MIMD  multigauge  machine  with  many  gauge  sizes  would  be 
much  more  useful  and  require  a  lot  more  hardware.  This  hardware  may  be  used  better  for 
other  purposes.  Finally,  multigauge  machines  are  subject  to  Amdahl’s  law,  i.e.  a  substantial 
amount  of  any  given  computation  may  require  the  wide  gauge  and  thus  yield  no  speedup. 
In  order  to  determine  which  of  these  possibilities  was  truth  and  which  was  conjecture,  we 
analyzed  some  specific  applications^].  The  results  were  impressive. 

Two  applications  have  been  studied  in  depth  in  order  to  ascertain  the  amount  of  speedup 
that  can  actually  be  achieved.  The  applications  both  come  from  graphics:  Bezier  curve 
generations  and  scan  conversion  using  Bresenham’s  algorithm.  Since  no  multigauge  machine 
has  actually  been  built,  it  has  been  impossible  to  obtain  true  measurements.  Nevertheless, 
our  methodology  is  sufficiently  detailed  that  we  can  confidently  report  the  findings  as  the 
next  best  thing  to  actual  measurements. 

The  methodology  is  to  seek  problems  of  practical  interest  that  compute  with  “narrow" 
data;  for  graphics  we  consulted  Professor  Tony  DeRose.  Next  the  problems  were  programmed 
in  C  and  tested  on  realistic  data  (on  a  VAX)  to  produce  actual  dynamic  behavior.  Then,  the 
C  source  programs  were  compiled  for  the  Quarter  Horse  microprocessor  whose  object  code 
can  be  put  into  one-to-one  correspondence  with  the  test  program  code.  The  difference  is,  of 
course,  that  the  “narrow”  gauge  instructions  can  be  executed  several  at  a  time.  Then  the 
machine  instructions  of  the  Quarter  Horse  object  program  were  expanded  to  their  microcode 
equivalents.  Each  of  these  microinstructions  has  a  fixed  duration  determined  by  the  clock 
rate  of  the  computer.  Consequently,  we  can  describe  the  behavior  of  the  machine  in  a 
way  that  can  be  reported  either  as  a  unit  free  speedup  (the  ratio  of  the  base  machine’s 
performance  over  the  multigauge  machine’s  performance)  or  in  absolute  seconds. 

The  results  are  striking  [3].  For  a  Bezier  curve  Q(u)  of  degree  n  defined  by 

<?(«)  =  «e(0,  1],  (1) 


where  V^,...,  Vn  are  controlling  points  commonly  called  control  vertices,  and  Bq ,  B"(u) 

are  the  nth  degree  Bezier  blending  functions  defined  by 

=  (  i  )  u'^  ~ 

we  can  perform  the  computation  for  a  1024  x  1024  pixel  display  (i.e.  10  bits  precision  on 
output)  with  6=16  bits  of  precision  internally,  which  is  required  to  control  error  propagation. 
Thus  a  k  =  2  multigauge  machine  can  achieve  speedups  q  for  n  points  in  the  range  [64.  oc] 

q  e  [1.937,  1.939). 

Of  course,  k  =  2  is  the  theoretical  best.  It  is  important  to  observe  that  the  Bezier 
curve  evaluation  must  change  gauge  for  each  generated  point,  an  overhead  many  algorithms 
wouldn’t  have,  and  still  it  approaches  optimal  speedup. 

To  get  an  idea  of  how  multigauging  would  apply  to  a  broad  class  of  graphics  problems, 
we  studied  line  segment  transformations  and  scan  conversion.  This  is  a  “typical”  graphics 
problem  in  that  it  uses  a  4  x  4  homogeneous  matrix  to  transform  points,  a  very  common 
‘subroutine’. 

Bresenham’s  algorithm  was  performed  by  a  k  =  2  multigauge  machine  that  generates 
points  for  a  line  “from  both  ends  towards  the  middle”,  a  scheme  that  lead  to  a  novel  ar- 
chitectual  feature,  the  virtual  register[3).  Using  the  kind  of  analysis  mentioned  above,  we 
find 

q  =  1.70  for  a  single  line, 

q  =  1.99  for  50  lines,  and 

q  =  1.9995  for  1000  lines. 

The  latter  case,  of  course,  is  most  typical  for  graphics  displays  of  “interesting”  objects. 
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A  variety  of  other  topics  were  treated  in  addition  to  the  multigauge  work,  as  outlined  in  the 
original  proposal.  In  this  section  we  detail  the  results  of  the  most  significant  ones. 

3.1  Poker 

The  Poker  Parallel  Programming  Environment  is  the  first  complete  set  of  programming  sup¬ 
port  facilities  developed  expressly  for  parallel  computation.  Developed  by  the  Blue  CHiP 
project  in  research  conducted  continuously  since  January  1982,  Poker  provides  a  novel  set¬ 
ting  for  programming  nonshared  memory  parallel  computers.  Because  the  concepts  are  com¬ 
pletely  new  and  differ  substantially  from  all  previous  programming  languages,  the  amount  of 
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research  that  has  been  expended  on  Poker  has  been  large.  Nevertheless,  concepts  in  Poker 
apply  directly  to  the  application  of  multigauging  in  parallel  architectures,  and  thus  Poker 
research  was  an  important  adjunct  to  the  multigauge  research.  We  review  the  specific  studies 
supported  by  this  contract. 

3.1.1  Initial  Distribution  of  Poker 

As  just  indicated,  Poker  contains  many  novel  ideas  that  were  motivated  by  the  realization 
that  parallel  programming  is  substantially  harder  than  serial  programming  and  that  pro¬ 
grammers  will  require  greater  support  from  their  programming  environment  if  they  are  to 
use  parallel  computers  productively.  Like  any  piece  of  research  some  of  the  ideas  in  Poker 
are  fundamental  and  important,  some  are  misguided  and  wrong-headed  and  most  are  in  be¬ 
tween.  These  are  the  ones  that  need  scrutiny  by  the  research  community,  but  the  only  way 
for  them  to  get  a  clear  picture  of  the  system  is  to  use  it.  Accordingly  we  spent  considerable 
time  preparing  Poker  for  distribution,  and  in  October  of  1985,  Poker  was  released  to  the 
research  community[8].  There  has  been  a  steady  flow  of  requests  for  the  system  ever  since. 

3.1.2  Retargeting  Poker 

Poker  was  originally  developed  to  program  the  CHiP  architectures  generally  and  the  Pringle 
Parallel  Computer  (a  hardware  simulator  of  CHiP  architectures),  specifically.  As  a  result 
many  of  its  features  were  specialized  to  the  CHiP  architecture.  When  it  became  clear 
that  Poker  could  be  a  far  more  general  facility  than  just  a  programming  language  for  the 
CHiP  Computer,  that  it  could  be  used  for  any  of  the  known  parallel  architectures,  then  it 
became  critical  to  study  two  problems:  First,  how  could  the  features  specialized  to  the  CHiP 
architecture  be  removed  and  replaced  with  features  with  wider  applicability,  and  second,  how 
could  Poker  be  restructured  to  generate  programs  for  arbitrary  parallel  machines,  i.e.  how 
could  Poker  be  retargeted?  Solutions  to  both  problems  have  been  developed  under  this 
contract. 

Recognizing  Poker  features  that  are  peculiar  to  the  CHiP  architecture  was  easier  than 
replacing  them  with  alternatives.  Examples  of  CHiP-specific  features  are  (refer  to  the  Poker 
description  [8]):  The  switches  (circles)  in  the  switch  setting  view,  the  use  of  the  XX  pro¬ 
gramming  language  for  specifying  processor  element  codes,  the  geometric  rather  than  the 
topological  layout-of  the  interconnection  graph,  the  convention  that  the  number  of  processors 
be  a  power  of  4,  etc.  The  switches  were  removed  simply  by  not  drawing  them;  they  still  play 
a  valuable  role  that  would  require  a  similar  technique  if  they  were  not  available,  so  it  was 
deemed  reasonable  to  just  keep  them  in  place  but  hidden.  The  XX  programming  language 
remains  in  the  system,  but  the  ability  to  use  C  as  a  processor  element  programming  language 
was  added.  Nothing  was  done  about  the  geometric  emphasis  of  the  communication  graph; 
it  was  just  too  fundamental  a  part  of  the  system  and  is  not  too  great  of  an  impediment  to 
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others.  Clearly,  any  subsequent  reimplementation  would  remove  this  geometric  emphasis  in 
favor  of  a  more  topological  style.  The  “power  of  four”  problem  is  simply  ignored  since  this 
is  not  a  serious  restriction  (most  parallel  computers  have  a  similar  character)  and  the  logical 
processors  used  by  Poker  can  always  be  ignored.  Other  CHiP-specifics  have  been  handled 
using  similar  “mixed”  strategies. 

The  real  challenge  to  the  Poker  research  team  was  the  retargeting  of  Poker  to  other 
parallel  architectures^].  Retargeting,  or  porting,  sequential  software  is  still  a  difficult  task 
a  quarter  of  a  century  after  the  problem  was  first  appreciated;  it  is  the  primary  motivation 
why  computer  manufacturers  retain  extinct  instruction  sets  for  their  computers  -  they  need 
the  existing  software  base  to  continue  to  run.  The  problem  is  IMMENSELY  more  difficult  in 
parallel  computation  because  the  architectures  “show  through”  into  the  programs  far  more 
in  parallel  computation  than  in  serial  computation.  Of  course,  there  is  no  base  of  parallel 
software,  but  if  it  is  to  be  developed,  the  retargeting  problem  must  be  solved. 

A  major  accomplishment  of  this  contract  was  the  retargeting  of  Poker  to  a  new  ma¬ 
chine^].  To  be  specific,  the  restructuring  of  Poker  in  order  that  it  be  able  to  compile  code 
for  different  machines  was  accomplished  by  the  current  contract  and  the  actual  porting  of 
Poker  to  the  CalTech  Cosmic  Cube  was  begun  as  a  proof  of  the  concept.  (The  port  was  com¬ 
pleted  under  another  contract.)  The  three  major  requirements  to  enable  Poker  to  generate 
code  for  another  parallel  computer  are  (1)  the  development  of  a  C  compiler  for  the  processor 
element  codes,  (2)  the  development  of  a  runtime  environment  for  the  Poker  programs  on 
the  host  machine,  and  (3)  the  construction  of  a  “mapper”  that  converts  the  communication 
structure  used  in  Poker  into  an  appropriate  communication  structure  for  the  host  machine. 
The  first  requirement  is  almost  always  met  by  the  computer  manufacturer  by  industry  tra¬ 
dition,  and  the  second  can  likely  build  effectively  on  the  available  software  provided  by  the 
manufacturer.  For  the  Cosmic  Cube,  these  three  constituents  required  approximately  3  man 
months  to  get  up  and  running. 

3.1.3  Hearts:  Poker  for  Systolic  Arrays 

The  Poker  Parallel  Programming  Environment  is  a  general  facility  for  programming  ar¬ 
bitrary  MIMD  parallel  computers.  It  is  common  to  use  systolic  arrays  as  subroutines  when 
engaged  in  such  programming,  but  doing  so  is  somewhat  cumbersome  because  of  the  gener¬ 
ality  of  Poker.  The  question  arises  whether  Poker  could  be  made  more  expressive  if  it  was 
limited  to  programming  only  systolic  arrays.  This  is  a  reasonable  restriction  since  there  are 
many  systolic  array  users  and  so  such  a  “restricted”  system  could  still  find  wide  applicability. 
We  have  considered  the  question  [10,11]  and  found  that  such  a  system  would  not  only  be 
feasible,  but  it  would  likely  be  very  convenient. 


3.2  Parallel  Programming  Paradigms 
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There  are  many  programming  paradigms  known  for  serial  programming  -  divide-and- 
conquer,  greedy,  dynamic  programming,  etc.,  but  because  parallel  programming  is  so  new. 
there  are  few  “standard”  problem  solving  methods  known.  A  graduate  student,  Phil  Nelson, 
has  studied  this  topic  for  his  doctoral  research  in  an  effort  to  identify  a  set  of  parallel  pro¬ 
gramming  paradigms.  He  has  produced  a  list  including  systolic  and  pipelined  algorithms, 
divide-and-conquer,  and  a  new  class  called  the  CAB  class,  for  compute,  aggregate  and  broad¬ 
cast. 

One  reason  to  study  paradigms  is  as  an  aid  to  discovering  new  algorithms:  A  technique 
that  worked  well  once  might  be  applied  in  another  context  with  good  results.  Using  this 
approach.  Nelson  has  discovered  a  new  matrix  multiplication  algorithm  that  is  based  on  the 
divide-and-conquer  paradigm[l2].  This  new  algorithm  uses  the  same  principles  of  Strassen’s 
optimal  serial  matrix  multiplication  algorithm  in  that  it  subdivides  the  matrix  product  prob¬ 
lem  until  it  reduces  to  a  large  set  of  2  x  2  matrix  products.  It  should  be  noted  that  Nelson’s 
algorithm  doesn’t  eliminate  any  multiplication  operations  and  so  is  not  subject  to  the  insta- 
blity  problems  that  Strassen’s  algorithms  is.  (This  result  was  reported  at  the  SIAM  meeting 
in  Norfolk,  Y'A.) 

Another  reason  to  study  paradigms  is  that  transformations  that  one  wishes  to  apply  to 
an  algorithm  might  well  be  applicable  to  all  of  the  algorithms  in  its  paradigmatic  class. 
Contraction  is  one  such  operation  which  has  been  studied  in  this  way[13].  Recall  that 
contraction  is  required  when  a  parallel  program  has  many  more  parallel  processes  than  the 
parallel  computer  has  parallel  processors,  and  so  multiple  processes  must  be  allocated  to 
each  processor  It  has  been  shown  that  algorithms  of  the  CAB  paradigm  can  be  contracted 
using  the  same  allocation  scheme. [13] 


4  Conclusions 


It  is  ev  from  the  research  summarized  in  this  report  that  multigauging  is  an  effective 
scheme  for  realutag  parallel  speedup  of  processors  when  the  data  being  processed  is  “small”. 
The  best  results  cpme  for  the  many  problems  that  admit  an  SIMD  execution  policy,  since  the 
SIMD  multigauging  capability  can  be  added  to  a  standard  computer  or  parallel  computer 
processing  element  with  little  hardware  cost.  In  this  case  multigauging  provides  essentially 
free  speedup  The  MIMD  caae  is  considerably  more  expensive  in  terms  of  hardware  and  its 
potential  applications  turned  out  to  be  unexpectedly  rare.  It  can  therefore  be  recommended 
that  an  SIMD  multigauge  machine  be  constructed  along  the  lines  discussed  in  the  reports 
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